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Abstract 

Aim: The aim of this research was to predict the availability of polyneuropathy disease of the type 2 diabetes mellitus patient through data mining algorithms. 
Material and Method: The dataset was obtained from the Bilecik Public Hospital and the instance number is 2907. Models were created with two different 
classification data mining algorithms. The data set includes Gender, Glycated Haemoglobin (HbA1c), Creatinine, Total Cholesterol, Low-Density Lipoprotein 
(LDL) and High-Density Lipoprotein (HDL). Numerical data were transformed into interval forms and the percentiles were calculated for each interval. Results: 
Data analysis and performance evaluation were performed with R, RStudio. Random Forest Tree was found as the best algorithm for polyneuropathy disease 
prediction (the accuracy = 0,922547332185886). The accuracy of the C4.5 was found 0,920826161790017. The percentages of the normal levels HbA1c 
are 6%, the impaired fasting glucose levels are 22% and the diabetes mellitus type 2 levels are %72. The percentage of the low Creatinine is 2%, the normal 
Creatinine is %86 and the high Creatinine is 12%. The percentage of the desirable levels Total Cholesterol is 46%, the percentage of the borderline levels of 
Total Cholesterol is 31% and the the percentage of high levels of Total Cholesterol is 23%. The percentage of the optimal levels LDL is 30%, the percentage 
of the near optimal levels of LDL is 31%, the percentage of the borderline high levels of LDL is 23%, the percentage of the high levels of LDL is 12% and the 
percentage of the very high levels of LDL is 4%. The percentage of the bad levels of HDL is 43%, the percentage of the better levels of HDL is 42% and the 
percentage of the best levels HDL is 19%. This model indicated that Creatinine, LDL and HbA1c are the primary three determinative factors on polyneuropa- 
thy disease. Furthermore, the model created the following 5 rules. Rule 1: If Creatinine <0,6 mg/dL and HbA1c <5,7 mmol/L then polyneuropathy disease is 
available for male, if Creatinine <0,5 mg/dL and HbA1c <5,7 mmol/L then polyneuropathy disease is available for female. Rule 2: If Creatinine <0,5 mg/dL and 
HbA1c >5,7 mmol/L and 160 < LDL <189 mg/dL and LDL <189 mg/dL then polyneuropathy disease is unavailable. Rule 3: If Creatinine <0,5 mg/dL and HbA1c 
>5,7 mmol/L and 160 < LDL s 189 mg/dL and LDL >189 mg/dL, then polyneuropathy disease is unavailable. Rule 4: If Creatinine <0,5 mg/dL and HbA1c >5,7 
mmol/L and 160 < LDL < 189 mg/dL, then polyneuropathy disease is unavailable. Rule 5: If Creatinine >0,5 mg/dL then polyneuropathy disease is unavailable. 
Discussion: The results show that HDL, Gender and Total Cholesterol have no significant effect on the polyneuropathy disease in this model. To determine the 
availability of polyneuropathy disease through the given data mining algorithms, researchers may consider the Creatinine, LDL and HbA1c scores. 
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Introduction 

Nowadays, due to the increasingly growing IT technologies, 
every single data of a patient is recording and storing in da- 
tabases. These bulk datum comprise utility knowledge for both 
profit and non-profit organizations. However, it’s a big issue 
that transforming datum into the reliable and significant knowl- 
edge to meet the practitioners’ expectation. Therefore, there 
is a concept named data mining [1]. The main manner of data 
mining is mining the datum to create the intelligible and util- 
ity knowledge by analyzing the hidden patterns and relation- 
ships between given attributes [2]. Through the growing usage 
of Internet technologies in almost all fields, data mining has 
increased attention and utilization in both profit and non-profit 
companies. In this manner, there has been a new era in health- 
care and medical via data mining modeling [3]. The research 
about data mining shows that it will help healthcare and medi- 
cine field to redound the success in diagnosis, patient services, 
prevent of the prevalence of chronic diseases, the efficiency of 
the health budget, cut of corruptions and etc. [4]. Data mining 
in healthcare and medicine is called health information that is 
a mixture of bio-information, clinical information, public health 
information, and neuro-information. These fields contribute to 
health information to gain datum and health information ana- 
lyzes these datum to create significant information and store to 
use in the future [5]. 

Healthcare data mining is based on the datum of clinical re- 
cords and used to determine the risk factors and prevalence [6]. 
Accordingly, data mining needs both qualitative and quantita- 
tive datum to use in data mining algorithms. These algorithms 
are clustering and making classification to the given datum to 
predict or estimate a situation. The classification in data min- 
ing is being used for prevention of mortality risk related to the 
diseases of cancer, diabetes mellitus and cardiovascular [7]. On 
the other hand, patients’ printed documents and e-records are 
vitally important to gain information about the correlations be- 
tween disease and its reasons. For this purpose, text mining 
and clustering algorithms are being used to put forward these 
relationships [8]. Herewith, the algorithms of data mining being 
used in healthcare can be divided into 2 subgroups which are 
classification algorithms and clustering algorithms. Classifica- 
tion algorithms include decision tree, k-nearest neighbor, neural 
networks, support machine, Naive Bayes and logistic regres- 
sion. The most used clustering algorithm is K-means clusters. 
These algorithms are applied in specific software to obtain 
knowledge [9]. 

Post-modern era evokes the over-consumption of people. Con- 
sequently, the chronic disease such as diabetes mellitus is also 
increasing, and it is a major problem that will be ended by mor- 
tality all over the world [10]. There are too many people have 
diabetes mellitus all over the world [11]. Thence, it is impor- 
tant to estimate the prevalence of diabetes mellitus via data 
mining. In order to obtain the data on diabetes mellitus, data 
warehouse including e-records, clinical data, inspection values, 
personal information, heritability and interviews with patients 
is substantial [12]. 


Material and Method 
There are 7 variables related to diabetes mellitus to predict the 


diabetic polyneuropathy. The variables are Diabetic Polyneurop- 
athy, Gender, Glycated Haemoglobin (HbA1c), Creatinine, Total 
Cholesterol, High-Density Lipoprotein (HDL), and Low-Density 
Lipoprotein (LDL). Datum was obtained from local health insti- 
tution and includes 2907 type 2 diabetes mellitus patients. The 
variables are following: 


a- Diabetic Polyneuropathy: 

Diabetic neuropathies are a heterogeneous group of pathologi- 
cal manifestations with the potential to affect every organ, with 
clinical implications such as organ dysfunction, which leads to 
low-quality life and increased morbidity. DPN is defined as pe- 
ripheral nerve dysfunction with positive and negative symp- 
toms. Risk factors include age, male gender, duration of diabe- 
tes, uncontrolled glycemia, height, overweight and obesity, and 
insulin treatment [13]. 


b- Gender: 

There is increasing evidence that sex and gender differences 
are important in epidemiology, pathophysiology, treatment, 
and outcomes in many diseases, but they appear to be particu- 
larly relevant for noncommunicable diseases. Sex differences 
describe biology-linked differences between women and men, 
which are caused by differences in sex chromosomes, sex-spe- 
cific gene expression of autosomes, sex hormones, and their 
effects on organ systems. Both biological and psychosocial fac- 
tors are responsible for sex and gender differences in diabetes 
risk and outcome [10]. 


c- Glycated Haemoglobin (HbA1c): 
HbA1c is a blood test that measures the average blood glucose 
level over the previous 3-4 months [14]. 


d- Creatinine: 

Creatinine is a waste product of muscle metabolism that is 
normally removed by the kidneys. The presence of excess cre- 
atinine is an indication of increased muscle breakdown or a dis- 
ruption of kidney function [14]. 


e- Total Cholesterol: 

Total cholesterol is measured in terms of milligrams (mg) per 
deciliter (dL) of blood. A milligram is equal to one-thousandth 
of a gram. A deciliter is equal to one-tenth of a liter. Desirable 
levels are below 200 mg/dL. Borderline high levels are 200-239 
mg/dL. 

High levels are 240 mg/dL and above [14]. 


f- Low-Density Lipoprotein (LDL): 

LDL is referred to as bad cholesterol because excess quanti- 
ties of LDL contribute to plaque buildup in the arteries. Optimal 
levels are below 100 mg/dL. Near optimal is between 100 and 
129 mg/dL. Borderline high level is between 130 and 159 mg/ 
dL. High level is between 160 and 189 mg/dL. Very high level is 
190 mg/dL and above [14]. 


g- High-Density Lipoprotein (HDL): 
HDL is referred to as a good cholesterol because it carries un- 
needed cholesterol back to the liver for processing and does 
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not contribute to plaque buildup. Bad levels are below 40 mg/ 
dL. Better levels are between 40 and 59 mg/dL. Best levels are 
60 mg/dL and above [14]. 


Statistical Analysis 

All variables were subjected to the data mining algorithms that 
are C4.5 decision tree and random forest algorithm to estimate 
the diabetic polyneuropathy. Accordingly, the software of R pro- 
gramming was also used to apply these algorithms and findings 
were noted. 


a-C4.5. Decision Tree 

A decision tree is a classifier expressed as recursive partition 
of the instance space. The decision tree consists of nodes that 
branch within a rooted tree. It starts with a root at the top that 
has no incoming edges. A node with outgoing edges is called 
an internal node, and all the other nodes are called leaves, also 
known as decision nodes. Each leaf is assigned to one class 
representing the majority target value at that node [15]. 

b- Random Forest Tree 

The RF algorithm, which is widely used for classification in 
bioinformatics, builds nTree (a parameter) Random Trees (RT) 
during its training phase. This involves randomizing the train- 
ing set in two ways for each RT: First, the training set is re- 
sampled with replacement, maintaining the original size of the 
dataset. As a second source of randomness for building an RT, 
the search for the best feature to split the set of instances at 
each RT node considers a randomly chosen feature subset of 
size mtry (a parameter), typically much smaller than the origi- 
nal feature set’s size. The instances at the current node are 
then split into two subsets according to a condition based on 
the values of the selected feature, creating two child nodes. 
This split aims to increase the similarity of classes within each 
instance subset and to decrease class similarity across the sub- 
sets. Next, the algorithm recurses in each instance subset until 
a stopping criterion is met [16]. 


Results 
Due to the processing of data mining, the following steps are 
applying [17]: 


Pre-processing: 

In the given dataset, there were numbers of noisy datum that 
affect the modelling process. These noisy data are occurred 
by incorrect non-numerical columns. Accordingly, handle with 
these missing values, the clearance step was applied and some 
of the data were removed from the dataset. 


Table 6. Summary of Datas in the R Programming 


Table 1. Transformation of HbA1c Dataset 


HbAtc Accepted HbA1c The Number of Percentage 
Intervals Patient out of 2907 (%) 
Normal x <5,7 166 patient 6 
Impalred fasting [5,7-6,4] 641 patient 22 
glucose 
Diabetes Mellitus 06,4 2100 patient Nn 
Type 2 
Table 2. Transformation of Creatinine Dataset 
Accepted Accepted The Number 
os Creatinine Creatinine Of Patient Percentage 
Creatinine 2 7 
intervals for intervals for out of (%) 
Male Female 2907 
Low x<0,6 x<0,5 42 2 
Normal [0,6-1,2] [0,5-1,1] 2492 86 
High x>1,2 x>1,1 373 12 
Table 3. Transformation of Total Cholesterol Dataset 
Total cholesterol Accepted total The Number Of Percentage 
cholesterol intervals Patient out of 2907 (%) 
Desirable levels x<200 1346 46 
Borderline levels [200-240] 893 31 
High levels x>240 663 25, 
Table 4. Transformation of HDL Dataset 
HDL Accepted HDL Accepted HDL The Number Percentage 
Cholesterol cholesterol cholesterol Of Patient (%) 
intervals for intervals for out of 
Male Female 2907 
Bad levels x<40 x<50 1257 43 
Better levels [40-60] [50-60] 1256 42 
Best levels x>60 x>60 394 19 
Table 5. Transformation of LDL Dataset 
LDL Accepted LDL The Number Of Percentage 
Cholesterol cholesterol intervals Patient out of 2907 (%) 
Optimal levels x<100 860 30 
Near optimal levels [100-129] 888 31 
Borderline 
high levels [130-159] 681 23 
High levels [160-189] 350 12 
Very high levels x>189 128 4 


Transformation of the Datum: 

Due to the nature of using algorithms of datamining, dataset 
numbers were transformed into percentage values shown in 
Table 1, Table 2, Table 3, Table 4 and Table 5. The HbA1c da- 


“Diabetic polyneuropathy” 


Age Gender HbAlc Creatinine Total cholesterol HDL LDL Diabetic 
Polyneuropathy 

Min. :16,00000 1:1167 Min. : 6,00000 Min. : 2,00000 Min. :23,00000 Min. :19,0000 Min. : 4,00000 Yes: 222 

1st Qu.:52,00000 2:1740 — 1st Qu.:22,00000 1st Qu.:86,00000 1st Qu.:31,00000 1st Qu.:42,0000 1st Qu.:23,00000 No: 2685 


Median :60,00000 
Mean :59,66598 
3rd Qu.:68,00000 
Max. :91,00000 


Median :72,00000 
Mean :57,20605 
3rd Qu.:72,00000 
Max. :72,00000 


Median :86,00000 
Mean :75,29137 
3rd Qu.:86,00000 
Max. :86,00000 


Median :31,00000 
Mean :36,10698 
3rd Qu.:46,00000 
Max. :46,00000 


Median :42,0000 
Mean :39,3151 
Mean :39,3151 
Max. :43,0000 


Median :30,00000 
Mean :25,35363 
3rd Qu.:31,00000 
Max. :31,00000 
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Table 7. The Accuracy of C4.5 Algorithm 


Estimation 
Acceptance Available Unavailable 
Available 1 43 
Unavailable 3 534 


“Accuracy= 0,920826161790017” 


Table 8. The Accuracy of Random Forest Tree Algorithm 


Estimation 
Acceptance Available Unavailable 
Available 1 43 
Unavailable a 535 


“Accuracy 0,922547332185886” 


taset is divided into 3 sub-groups. The first one is the normal 
group, the second one is the impaired fasting glucose and the 
third group is the diabetes mellitus type 2. The accepted val- 
ues for the normal group is below 5,7 mmol/L, the impaired 
fasting glucose is 5,7-6,4 mmol/L, and the diabetes mellitus 
type 2 is 6,4 mmol/L and above. Accordingly, normal sub-group 
includes 166 patients was transformed into 6%, impaired fast- 
ing glucose sub-group includes 641 patients was transformed 
into 22%, and diabetes mellitus type 2 sub-group includes 2100 
patients was transformed into 72%. 

The Creatinine dataset was divided into 3 sub-groups. The first 
one is the low Creatinine, the second one is the normal Cre- 
atinine and the third one is the high Creatinine. The accepted 
values for the low Creatinine is below 0,6 mg/dL for male and 
below 0,5 mg/dL for female, the normal Creatinine is 0,6-1,2 
mg/dL for male and 0,5-1,1 mg/dL for female and the high Cre- 
atinine is 1,2 mg/dL and above for male and 1,1 mg/dL and 
above for female. The percentage of the low Creatinine is 2%, 
the normal Creatinine is 86%, and the high Creatinine is 12%. 
The Total Cholesterol dataset was divided into 3 sub-groups. 
The first one is the desirable levels, the second one is the bor- 
derline levels and the third one is the high levels. The accepted 
values for the desirable levels are below 200 mg/dL, the border- 
line levels are 200-240 mg/dL and the high levels are 240 mg/ 
dL and above. The percentages of the desirable levels are 46%, 
of the borderline levels are 31%, and of the high levels are 23%. 
The HDL data set was divided into 3 sub-groups. The first one 
is the bad levels, the second one is better levels and the third 
one is the best levels. The accepted values for the bad levels 
are below 40 mg/dL for male and below 50 mg/dL for female, 
the better levels are 40-60 mg/dL for male and 50-60 mg/dL 
for female and the best levels are 60 mg/dL and above for male 
and 60 mg/dL and above for female. The percentages of the 
bad levels are 43%, the better levels are 42% and the best lev- 
els are 19%. 

The LDL dataset was divided into 5 sub-groups. The first one 
is the optimal levels, the second one is the near optimal levels, 
the third one is the borderline high levels, the fourth one is the 
high levels and the fifth one is the very high levels. The accepted 
values for the optimal levels are below 100 mg/dL, the near 
optimal levels are 100-129 mg/dL, the borderline high levels 
are 130-159 mg/dL, the high levels are 160-189 mg/dL, and 
the very high levels are 189 mg/dL and above. The percentages 
of the optimal levels are 30%, the near optimal levels are 31%, 


the borderline high levels are 23%, the high levels are 12%, and 
the very high levels are 4%. 


Processing 

Data were analyzed with R programming based on the clas- 
sification algorithms and shown in Table 6. In the R program- 
ming, the random start point was obtained by “set.seed()” func- 
tion code. Afterward, the dataset was divided into the training 
set and the test set. The training set is a learning process to 
ascertain the algorithms and the test set is for testing the ac- 
curacy of this algorithm. In addition to accuracy, for evaluating 
the performance of the algorithms, the correspondence matrix 
was used. 


Data Mining Process for C4.5. 

The results of data mining with C4.5 algorithm are shown in 
Figure 1 and the accuracy rate is shown in Table 7. Due to the 
results, the accuracy value is 92%. Accordingly, the codes are: 
*¢ J48 pruned tree 

* Creatinine <= 2 

¢ HbA1c <= 6: available (2.0) 

¢ HbAIc> 6 

¢LDL <= 12 

¢ LDL <= 4: unavailable (3.0) 

¢ LDL > 4: available (6.0/1.0) 

¢ LDL > 12: unavailable (24.0/1.0) 

* Creatinine > 2: unavailable (2291.0/1 70.0) 


¢ Number of Leaves : 5 
¢ Size of the tree: 9 
- 4 Creabrend 
{2) —— $2 -2 
HDA Ic 
<6— 6 f4} 
LOW 
la 12 > 12 
LOW 
s4— >4 


ode 3 (n= 2) odo 6 (5 = 3) odo7in=6) Node 8 (n= 24) Naoge 9 (p= 2901) 
SEES be Spm bo SLA bs Figmmiime oS aalime 


Figure 1. Datamining with C4.5. 


According to the codes, the rules of C4.5 algorithm data mining 
are following: 

Rule 1: If Creatinine <=2 and if HbA1c <= 6 then diabetic poly- 
neuropathy: available. 

Rule 2: If Creatinine <=2 and if HbA1c > 6 and if LDL<=12 and 
if LDL<=4 then diabetic polyneuropathy: unavailable. 

Rule 3: If Creatinine <=2 and if HbA1c>6 and if LDL<=12 and if 
LDL>4 then diabetic polyneuropathy is available. 

Rule 4: If Creatinine <=2 and if HbAlc>6 and if LDL>12 then 
diabetic polyneuropathy: unavailable. 

Rule 5: If Creatinine >2, then diabetic polyneuropathy: unavail- 
able. 


Data Mining Process for Random Forest Tree 

In the Random Forrest Tree algorithm, 500 trees were found. 
The number of variables tired at each split is 2. The correspon- 
dence matrix and the accuracy are shown in Table 8. The ac- 
curacy value is %92. 
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Discussion 

According to the results, there are 5 rules for the availability 
of polyneuropathy disease. Due to the rule 1, if Creatinine < 2 
and if HbA1c < 6 then polyneuropathy disease is available. In 
this condition, we should look at the Creatinine table. In this 
expression, 2 is a percentage and in the real data set, it refers 
to the 0,6 mg/dL for male and 0,5 mg/dL for female. Likewise, 
the score of HbA1c refers to 5,7 mmol/L. Hereby, the rule 1 is: 
If Creatinine < 0,6 mg/dL and HbA1c<5,7 mmol/L then polyneu- 
ropathy disease is available for male. 

If Creatinine < 0,5 mg/dL and HbA1c<5,7 mmol/L then polyneu- 
ropathy disease is available for female. 

Considering the rule 2, if we peruse the related tables, we ac- 
quire the following formula: 

If Creatinine < 0,5 mg/dL and HbA1c >5,7 mmol/L and 160 < 
LDL < 189 mg/dL and LDL < 189 mg/dL then polyneuropathy 
disease is unavailable. 

Considering the rule 3, the formula is: 

If Creatinine < 0,5 mg/dL and HbA1c > 5,7 mmol/L and 160 < 
LDL < 189 mg/dL and LDL>189 mg/dL, then polyneuropathy 
disease is unavailable. 

Considering the rule 4, the formula is: 

If Creatinine < 0,5 mg/dL and HbA1c >5,7 mmol/L and 160 < 
LDL < 189 mg/dL, then polyneuropathy disease is unavailable. 
Considering the rule 5, the formula is: 

If Creatinine >0,5 mg/dL then polyneuropathy disease is un- 
available. 

The conclusion of these rules is Creatinine, LDL and HbAtic 
are the primary three determinative factors on polyneuropathy 
disease. However, through the given dataset, Gender, HDL and 
Total Cholesterol have no significant relationships with the pre- 
diction of polyneuropathy disease. 

On the other hand, one of the foremost results of this research 
is the accuracy of C4.5 classification algorithm. Previous signifi- 
cant studies on diabetes mellitus put forth the accuracy value 
of their C4.5 classification algorithm. Lakshmi and Kumar (18) 
used C4.5 classification algorithm to predict the diabetes mel- 
litus. In this research, the accuracy value of the C4.5 was found 
at 72%. Another research is Radha and Srinivasan’s study (19) 
used C4.5 algorithm to clinical data set to predict the diabetes 
mellitus. In this research, the accuracy value of C4.5 algorithm 
was found at 86%. Furthermore, Devi and Shyla (20) consulted 
to C4.5 algorithm in their research to predict the diabetes melli- 
tus. In this study, C4.5 algorithm’s accuracy was found at 86%. In 
our study, the accuracy value was found 0,92082616179001/7. 
Comparing with these studies, C4.5 algorithm is running with 
a high accuracy value of 92%. The score elucidates that the 
model can estimate 2154 instances correctly in 2326 classified 
instances. Moreover, the random forest creates 500 trees and it 
has almost the same accuracy value with C4.5. The accuracy for 
the Random Forest Tree is 0,922547332185886 and it shows 
that the model is in high accuracy in the value of 92%. 
Consequently, to determine the availability of polyneuropathy 
disease through the given data mining algorithms, researchers 
may consider the Creatinine, LDL and HbA1c scores. However, 
the data set variables are the main limitation of this study. The 
model estimates the polyneuropathy disease with 6 variables. 
In the future, the model would be tested with varied variables 


and the results would be compared with the current outcomes. 
Another limitation is the generalization problem. The results are 
only valid for the discussed instance, and outcomes could be 
changeable with other population. 
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